-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
o365input: Restart after fatal error #21258
Conversation
Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilent against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes.
Pinging @elastic/siem (Team:SIEM) |
@urso can you have a quick look? Is there a better way to "restart" an input? |
Thanks for taking this and for resolving. Can you please let me know any plans when it will be roll out to new version |
@27bharath this fix will be available in the next versions, 7.9.3 and 7.10.0 |
* upstream/master: feat: prepare release pipelines (elastic#21238) Add IP validation to Security module (elastic#21325) Fixes for new 7.10 rsa2elk datasets (elastic#21240) o365input: Restart after fatal error (elastic#21258) Fix panic in cgroups monitoring (elastic#21355) Handle multiple upstreams in ingress-controller (elastic#21215) [CI] Fix runbld when workspace does not exist (elastic#21350) [Filebeat] Fix checkpoint (elastic#21344) [CI] Archive build reasons (elastic#21347) Add dashboard for pubsub metricset in googlecloud module (elastic#21326) [Elastic Agent] Allow embedding of certificate (elastic#21179) Adds a default for failure_cache.min_ttl (elastic#21085) [libbeat] Disk queue implementation (elastic#21176)
Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilient against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes. (cherry picked from commit 8716d98)
Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilient against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes. (cherry picked from commit 8716d98)
Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilient against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes. (cherry picked from commit 8716d98)
Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilient against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes. (cherry picked from commit 8716d98)
publisher.Publish(event, nil) | ||
ctx.Logger.Errorf("Input failed: %v", err) | ||
ctx.Logger.Infof("Restarting in %v", failureRetryInterval) | ||
time.Sleep(failureRetryInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can block proper filebeat shutdown for 5min. Better use timed.Wait
that is cancellable from go-concert.
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination.
PR #21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination.
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination. (cherry picked from commit 1abe97b)
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination. (cherry picked from commit 1abe97b)
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination. (cherry picked from commit 1abe97b)
elastic#21386) Update the o365input to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error. This enables the input to be more resilient against transient errors. Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes. (cherry picked from commit c723c1e)
PR elastic#21258 introduced a restart mechanism for o365input so that it didn't stop working once a fatal error was found. This updates the restart delay to use a cancellation-context-aware method so that the input doesn't block Filebeat termination. (cherry picked from commit f2ab428)
What does this PR do?
Updates
o365input
to restart the input after a fatal error is encountered, for example an authentication token refresh error or a parsing error.This enables the input to be more resilient against errors.
Before this patch, the input would index an error document and terminate. Now it will index an error and restart after a fixed timeout of 5 minutes.
Why is it important?
Some users are reporting that the
o365
module stops ingesting events after some days. In all cases it's been observed that the input terminated at some point due to errors contacting the Azure authentication server to refresh a token.Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Testing the case of token refresh errors is difficult as they are refreshed once every ~12h. But the behavior can be tested by starting Filebeat without an internet connection.